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Abstract. In this paper, we propose a new approach to building synchronization prim- 
itives, dubbed "Iwlocks" (short for light-weight locks). The primitives are optimized for 
small memory footprint while maintaining efficient performance in low contention scenar- 
ios. A read-write Iwlock occupies 4 bytes, a mutex occupies 4 bytes (2 if deadlock detection 
is not required), and a condition variable occupies 4 bytes. The corresponding primitives 
of the popular pthread library occupy 56 bytes, 40 bytes and 48 bytes respectively on the 
x86-64 platform. The API for Iwlocks is similar to that of the pthread library but covering 
only the most common use cases. Lwlocks allow explicit control of queuing and schedul- 
ing decisions in contention situations and support "asynchronous" or "deferred blocking" 
acquisition of locks. Asynchronous locking helps in working around the constraints of lock- 
ordering which otherwise limits concurrency. The small footprint of Iwlocks enables the 
construction of data structures with very fine-grained locking, which in turn is crucial for 
lowering contention and supporting highly concurrent access to a data structure. Currently, 
the Data Domain File System uses Iwlocks for its in-memory inode cache as well as in a 
generic doubly-linked concurrent list which forms the building block for more sophisticated 
structures. 



1 Introduction 

The advent of the multi-core systems has forced a rethinking of basic data structures in order to 
support greater scalabihty and concurrency [11]. While there have been good strides in building 
lock- free versions of certain data structures [2,5], and software transactional memory (STM) 
based techniques are becoming popular [9, 10], the use of traditional locking techniques remains 
the de-facto standard for synchronization in shared-memory systems. The usual technique for 
increasing concurrency using traditional locking schemes, aside from using algorithms that reduce 
the concurrent sections [4, 6] , is to use different locks for different parts of the data structures. 
The use of such fine-grained locking often runs afoul of the overhead involved, thereby limiting 
the maximum number of locks used. To minimize the space overhead, the algorithms usually try 
to minimize the number of locks, and in turn need to build a mapping to and from different parts 
of the structure to the corresponding lock. This adds to the complexity of the code that needs to 
be maintained. 

In this paper we present a novel technique to create locking primitives that have a very small 

memory footprint. Wc call our locks "light-weight locks" or "Iwlocks". Specifically, a read- write 
lock in our scheme takes 4 bytes, a mutex takes 4 bytes (only 2 if deadlock detection is not 
required), and a condition variable takes 4 bytes. The corresponding primitives of the popular 
pthread library occupy 56 bytes, 40 bytes and 48 bytes respectively on the x86-64 platform. The 
API for Iwlocks is modeled after that of the pthread library. We however eschew some of the 
features provided by pthread locks for the sake of simplicity of our implementation. 

We consider our contributions as being four-fold: (i) locking primitives with small memory foot- 
print which makes them ideal for very fine-grained locking; (ii) the mechanism underlying the 



implementation of Iwlocks that allows creation of custom lock-like primitives; (iii) access to wait- 
ing queue of threads so custom scheduling schemes can be implemented; and (iv) support for 
"asynchronous" or "deferred block" locking. 

In this paper, we focus largely on Iwlocks. The rest of the paper is organized as follows: Section 2 
describes the idea that forms the basis of Iwlocks. Section 3 describes the internal structure 
of the supported primitives and the algorithms for implementing their APIs. Section 4 briefly 
describes possible extensions to Iwlocks and how asynchronous locking works. Section 5 compares 
the performance of Iwlocks with the corresponding primitives in the pthread library. Finally, in 
Section 6 we present our conclusions. 

2 The Fundamental Idea 

The core idea behind Iwlocks is the observation that while a thread could block on different 
locks or wait on many different condition variables in its lifetime, it can block on only one 
lock or condition variable at any given point. With Iwlocks, whenever a thread has to block, it 
uses a "waiter" structure to do so. In this paper, we use the term "waiter structure" or simply 
"waiter" interchangeably. Each thread has its own waiter structure and can access it by invoking 
the tls_get_waiter function (which returns the pointer to the waiter kept in the thread local 
storage) . 

Figure 1 presents the definition of a waiter structure. For compact representation, we limit the 
maximum number of waiter structures to be less than 2^^ so that each structure can also be 
uniquely referred to by a 16-bit number. We reserve the value 2^^ — 1 to represent the NULL waiter 
structure and denote it by NULLID. We expect the limit on number of waiters, and hence on the 
number of threads, to be large enough for most applications^. 

/ / The following definitions assumes 
// that each booLt takes 4 bytes, each 
// pthread-mutexJ, takes 40 bytes, each 
// pthread-cond-t takes 4S bytes, and 
/ / each function pointer takes 8 bytes. 
interface event _t { 

void signal 0; 

void wait ( ) ; 

bool_t poll(); 
} // 24 bytes 

interface domain.t { 

waiter_t alloc-waiter () ; 
void free_waiter ( waiter_t waiter); 

Fig. 1: Definition of a waiter structure. 

A waiter structure is assigned to a thread the first time the thread accesses it (via tls_get_waiter) 
and the structure is returned to the pool of free waiter structures when the thread exits to be 
re-used by a later thread. The waiter structure is the key piece that enables the compact nature 
of Iwlocks. It can also be used to create other custom compact lock-like data structures. The 
current non-optimized implementation of a waiter structure occupies 166 bytes. Since this cost is 

^ The limit can be increased for a small increase in the size of the locks which presumably will be 
acceptable for an application that can support so many threads. 



waitcr.t get_waiter ( ) ; 

waitcr.t id2waiter(uintl6_t id); 

} // 32 bytes 

struct waitcr_t { 

cvcnt_t event; 

domain_t domain; 

bool_t signal_pcnding ; 

bool_t waiter_waiting ; 

pthread_mutex_t mutex; 

pthread-cond-t cond; 

uint64_t app_data; 

uintl6_t id; 

uintl6_t next; 

uintl6_t prev; 
} // 166 bytes 



per thread and we expect the normal use case to have far fewer threads than the number of locks, 
the amortized cost is very low. For example, an application with 1, 024 threads and around 4, 725 
mutexes will have the same memory footprint whether using Iwlocks or pthread mutexes. Our 
normal expected use case is for applications that need several hundred thousands or millions of 
locks which would make Iwlocks the clear choice for locking. We now describe the most important 
pieces that form a waiter structure. 

Waiter's Event. A generic event interface underlies the actual mechanics that are used by a 
thread when it blocks or unblocks on a lock. The two main operations defined for an event are: (i) 
wait, which is called to wait for the event to trigger; and (ii) signal, which informs a waiter of 
an event getting triggered. A waiter structure uses one pthread mutex and one condition variable 
to implement both operations. The operation wait blocks the thread on the condition variable 
until a signal arrives. The operation signal wakes up the blocked thread. Like semaphores, the 
implementation ensures that a signal on an event cannot be lost, i.e., a signal can be invoked 
before the matching wait is and the wait will find the pending signal. Unlike a semaphore, 
however, the operations wait and signal are always called in pairs. There is also a third operation 
called poll. It can be used to check if a signal is already pending. 

Waiter's Domain. Instead of a fixed implementation for mapping from an id to the waiter 
structure, we have abstracted out the notion of a waiter's domain. A waiter's domain defines four 
operations: (i) alloc_waiter to allocate a waiter from the domain; (ii) f ree_waiter to return a 
waiter back to the domain; (iii) get_waiter which allows a thread to get to its own waiter; and 
(iv) id2waiter to map from an id to the waiter structure. 

Abstracting the notion of a domain has three benefits. First it provides one more way of extending 
the system so that instead of an entire application being limited to a maximum of 2^^ threads, the 
limit only applies to individual domains. Second it provides the flexibility to create domains that 
have lower limit on maximum concurrency, thereby allowing for creation of locks with even smaller 
footprint. For example, a system limiting itself to 15 threads (127 without deadlock detection) 
would need only 1 byte for a mutex. Third, combining it with a custom event, allows for creation 
of libraries such as a user space job scheduler. The wait call on a job blocks it and causes the 
scheduler to switch to another ready job while the signal call marks the job as ready again. We 
mention the waiter's domains only for completeness as they are not necessary to understand the 
workings of Iwlocks. Lwlocks use a default global domain whose waiter structures implement the 
behavior we describe here. 

Forming Lists or Stacks of Waiters. Each waiter structure records its own id. It also has 
space for previous and next id values which can be used to form stacks or lists of waiters. Such a 
list (or stack) of waiters can be identified purely by the id of the first element of the list. i. e., it 
can be represented by a 16-bit value. To go to the next (previous) waiter, we convert the current 
id to the corresponding waiter structure and look at the next (previous) id field in it. 

Locking Data. The final important piece of a waiter structure is the space it provides that 

can be used by the abstractions built on top for their own purpose. The waiter itself docs not 
interpret it in any way. For instance, read- write Iwlocks use this space to record the type of locking 
operation that the thread was performing when it blocked: whether it was taking a read or write 
lock. Currently, this space amounts to 8 bytes and is referred to as app_data. 



3 Light- Weight Lock Primitives 



We now look at the internals of each of the Iwlock primitives, the supported operations and how 

they work. The Iwlocks by default are "fair": a lock is acquired in FIFO order by the threads 
blocked on it and wake-ups from a condition variable are done in the order in which the threads 
called wait on the condition variable. Pthread locks are not fair in this sense, and although it is 
possible to build Iwlocks to mimic the same behavior, we have found fairness to be better suited 

to our needs in the Data Domain File System [12]. 

Each primitive uses 2 bytes to keep a queue of waiter structures of the threads that are blocked 
on that primitive. This queue is aptly called a waitq. The waitq is maintained as a "reverse list" 
as that allows insertion of a new waiter in a single hardware supported compare-and-swap (CAS) 
instruction. The next field of a waiter structure holds the id of the waiter structure in front of it. 
The oldest waiter's next field holds NULLID. The oldest waiter is the waiter in front of the waitq. 

To acquire a lock, a thread uses the CAS instruction to either take ownership of the lock or add its 
own waiter structure to the lock's waitq. If the lock is acquired, nothing more needs to be done. 
If it cannot be acquired, then the thread waits on its waiter structure (by calling the event's wait 
routine). When the thread's turn comes to own the lock (in FIFO order), the unlocking thread 
will transfer the lock to it and invoke the event's signal routine on the waiter to wake up the 
thread. Since the unlocking thread does the work of transferring the lock state and ownership, 
the waking thread can assume that it has the lock upon being signaled^. The unlocking thread 
has to walk the waitq to find the waiter to signal. At any point there can be only one thread 
performing the transfer on a lock and hence the walk is safe to perform. 

We now present each one of the Iwlock primitives. Note that we only highlight the essence of 
the various operations in the included algorithms. The actual implementation, which we hope to 
release to the open source community in the near future, has additional logic for performance 
optimization. 

Light-weight mutex. The 4 bytes of a light-weight mutcx (henceforth a Iwmutex) arc composed 
of 2 equal parts. The first part holds the id of the waiter structure of the owner thread and the 
second part is the waitq. The owner id is necessary to do self-deadlock detection. Figure 2 outlines 
the lock and unlock algorithms for the 4-byte version of a Iwmutex. If deadlock detection is not 
required, the lock only needs to be 16 bits in size to hold the waitq. To comply with POSIX 
semantics, we also need to be able to ascertain the owner of such a mutex. Fortunately, we can 
use the same waitq space. The locking thread swaps the NULLID of the waitq with the id of its 
own waiter to indicate that the lock is taken. As other threads block, their waiter structures get 
added to the waitq as in the case of the regular Iwmutex. The difference is that the next field of 
the waiter structure in front of the waitq does not hold NULLID. Instead, it holds the id of the 
waiter of the lock owner thread. Hence, the unlock operation traverses the waitq until a waiter 
whose next field matches the id of the unlocking thread's waiter structure is reached. The next 
field is reset to NULLID and the waiter is signaled. 

Light-weight condition variable. The 4 bytes of a light-weight condition variable (henceforth 
a Iwcondvar) are composed of 2 equal parts. The first part is a 2-byte version of Iwmutex and the 
second part is the queue of waiter structures. There are three basic operations for a Iwcondvar: 
(i) wait; (ii) signal; and (iii) broadcast. The internal 2-byte Iwmutex is used to synchronize 



^ For unfair locks, this part has to change and the waking thread would need to try again to take the 
lock. 



struct lwmutex_t { 

uintl6_t owner; // owner ID 
uintl6_t waitq; // tail of queue 

} 

void lock (lwmutex_t m) { 
to = tls-get-waiter ; 
do { 

n = o = m ; 

if (o. owner == NULLID) 

n . owner = t« . id ; 
else { 

tij.next = n. waitq; 

n . waitq = ■!« . id ; 

} 

} while (!CAS(m, o, n)); 
if (n. owner == ty.id) 
return; // Got lock 

// Wait Jar lock transfer 
w . event . wait ( ) ; 

} 



void unlock (Iwmutcx.t m) { 
w = tls_get_waiter ; 
do { 

n = o = m; 

if (o. waitq != NULLID) { 
wtw = id2waitcr (m . waitq) ; 
pw = NULL; 

while (m! to. next != NULLID) { 
pw = wtw; 

wtw = id2waiter (lotw) . next ) ; 

} 

if {pw = NULL) n. waitq = NULLID; 
// Transfer lock to wtw 
n . owner = wtw . id ; 
} else n. owner = NULLID; 
} while (!CAS(m, o, n)); 
// remove wtw from q 
if (pw != NULL) 

pwj.ncxt = NULLID; 
if (wtw) loto . event . signal ; 

} 



Fig. 2: Operations to lock and unlock a Iwmutex. The old and new values passed in to CAS are 

denoted by o and n, respectively. The caller's thread local waiter structure is denoted by w. We use 
wtw and pw to denote the waiter to wake up and the previous waiter in the queue, respectively. 



manipulation of the waiter's queue which makes the algorithms for those three operations very 
easy to derive. The algorithms for the three operations are presented in Appendix A. 

Light-weight read-write lock. The light-weight road-write lock (henceforth a lw_rwlock) also 
uses 2 of its 4 bytes for the waitq. Of the remaining 16 bits, 14 bits are used for the count of 
read locks granted, 1 bit is used to indicate a write lock, and the final 1 bit is used to indicate 
whether the lock is read-biased or not. A read-biased lock is unfair towards writers in the sense 
that a thread that needs a read lock will acquire it without any regard to waiting writers if the 
lock is already held by other readers. This behavior is similar to that of pthread read-write lock 
and is essential for applications where a thread can recursively acquire the same lock as a reader. 
Without the read-biased behavior, a deadlock can result if a writer arrives in between two read 
lock acquisitions: the second read lock attempt will wait for the writer which is waiting for the 
first read lock to be released. Applications that do not have recursive read locking do not need 
the read-biased behavior but may choose to use it for throughput reasons. 

The 14-bit reader count limits the maximum number of readers per lock to 2^^, a limit that 
we have found to be sufficient in practice. The limit can be raised by having the API explicitly 
flag read-bias behavior, so the bias bit does not have to be in the lw_rwlock or restricting the 
maximum concurrency, thereby freeing bits from the waitq or by slightly increasing the size of 
the lock. 

Figure 3 outlines the algorithms for the two main operations: (i) lock, and (ii) unlock. The lock 
operation on lw_rwlock is similar to that of Iwmutex with the added flag indicating if a read or 
write lock is requested. The unlock operation for non-read-biascd lw_rwlock has to pick the oldest 
set of waiters that it can signal: either a single writer or a set of contiguous readers. A read-biased 
lw_rwlock can follow the same logic as a non-biased lw_rwlock when the transfer is to a waiting 
writer. For the transfer from a writer to reader(s), however, the writer has to signal all readers, 
not just the oldest contiguous set. The solution is to have the writer atomically remove the entire 
waitq and downgrade to a read lock. It then separates the waitq into two queues: one consisting 
of readers and one consisting of writers. The writers are added back to the front of the waitq 



while also updating the reader count to fully account for the readers found in the removed waitq. 
Finally, the readers can be signaled. Note that the re-insert of the waiting writers during unlock 
is safe. The re-insert is done at the front of the waitq and any new writers will add themselves 
to the back of the waitq. No other thread can be traversing the waitq for ownership transfer 
as the re-inserting thread holds a read lock. This case makes the implementation of lw_rwlocks 
the most complex of all the primitives and the algorithm outline is only at a high level for the 
contention case, where the waitq has at least one waiter in it. The non-contented case is simple 
to derive. 



struct lw_rwlock_t { 
uintl.t rd_bias; 
uintl_t wlocked; 
uintl4_t readers; 
uintl6_t waitq; 

} 

void lock(lw_rwlock_t booLt exclusive) { 
w = tls_get_waiter ; 
do { 

o = n = £; 

if (! exclusive &fe !o. wlocked &fe 
(o. waitq == NULLID | | 
o . rd_bias ) ) 
n . readers++; 
else if (exclusive &&: 

!(ra. wlocked || n. readers > 0)) 
n. wlocked = 1; 
else { // Need to block 
«!.app_data = exclusive; 
w . next = o . waitq ; 
n . waitq = ■«! . id ; 

} 

} while (!CAS(£, o, n)); 

if (n.waitq == w.id) ju.event.waitO; 

} 

void unlock_fair(lw_rwlock_t £) { 
do { 

o = n = £; 

if ( n . wlocked ^= 1 ) n . wlocked = ; 
else n. readers ; 

if (! (n. wlocked || n. readers > 0)) { 
{pw, wtw) = find_oldest_set_of_waiters(n) ; 
if (pw = NULL) n.waitq = NULLID; 
if (uitui . app.data != exclusive) { 
n. readers = waitq_size (u)tui.id) ; 
} else // single writer picked 
n . wlocked = 1 ; 

} 

} while (!CAS(^, o, n)); 

if (pw != NULL) pw. next = NULLID; 

wakc_up_waitcrs (tytty ) ; 

} 

uintl6-t waitq_sizc ( uiiitl6-t wid) { 
uiiitl6_t count = 0; 
while {wid != NULLID) { 



wid = id2waiter (tijid) . next ; 
count-h-h; 

} 

return count ; 

} 

void unlock(lw_rwlock_t £) { 
if (!^.rd_bias) 

return unlock_fair (<?) ; 
if (^.wlocked = 0) { 
do { // Only writers in waitq 
o = n = £; 

if (n. readers ^= 1) { 

n. wlocked = 1; n. readers = 0; 

} else n. readers — ; 
} while (!CAS(^, o, n)); 
if (n. wlocked) unlock.fair (^) ; 
return ; 

} 

// writer unlocking a biased lock 

ow = find-oldest-waiter ) ; 

if (otu.app-data ^= exclusive) { 

/ / handing off to writer 

unlock_fair (^) ; return; 

} 

// Wake up all readers. Atomically 

// downgrade to read lock and 

/ / clear & return waitq. After the 

/ / downgrade only writers can block on £. 

waitq = downgrade_to_read_lock(^) ; 

// split waitq in two subqueues: 

// readers queue and writers queue 

(rd_q, wr_q) = splitq (waitq ) ; 

do { 

o = n = £\ 

//I reader added during downgrade. 
n. readers += waitq_size (rd_q) — 1; 
if (n.waitq = NULLID) n.waitq = wr_q 
} while (!CAS(£, o, n)); 

if (n.waitq != wr_q &fe wr_q != NULLID) { 
ow = find_oldest_waiter (n) ; 
ow . next = wr_q ; 



} 



} 

wake_up_waiters ( id2 waiter ( rd_q ) ) ; 



Fig. 3: Operations to lock and unlock a lw_rwlock. The lock operation takes a boolean as input 
to indicate whether an exclusive lock is requested. The old and new values passed in to CAS are 
denoted by o and n, respectively. The caller's thread local waiter structure is denoted by w. We 
use wtw, pw and ow to denote the waiter to wake up, the previous waiter in the waitq, and the 
oldest waiter in the waitq, respectively. 



4 Asynchronous Locking and Other Extensions 



We take a moment here to highlight some aspects of the algorithms presented in Section 3 and 

how small changes would enable alternative behaviors. On the locking side, the key observation is 
that once a thread has put itself on the wait queue of a lock or condition variable, it is guaranteed 
to have the lock transferred to it or a signal delivered to it. The thread does not have to call 
wait right away. The thread could spin for a certain amount of time on poll before calling wait 
effectively creating adaptive locks. It could also keep spinning which would create starvation-free 
spin locks. Both of these are scalable and contention- free similar to the approaches in [1, 3, 7]. 

Alternately, the lock operation could simply return without calling wait at all. This would allow 

the calling thread to take some application-specific action before invoking wait. We call this mode 
of operation as taking an "asynchronous" or "deferred blocking" lock. Asynchronous locking is the 
key enabler to work around the constraints that lock-ordering imposes. We use this functionality 
in building a generic highly concurrent doubly-linked list in the Data Domain File System [12]. 
The list allows concurrent appends, dequeues, inserts (before or after any member), deletes and 
iterators (in either direction). Some of these operations need to acquire locks in opposite order 
of other operations. To avoid deadlocks, a canonical order is picked and operations that need to 
acquire locks in the opposite direction use asynchronous locking. 

The following example, taken from doubly-linked list implementation, illustrates how asyn- 
chronous locking is used and why it is essential. Suppose the canonical order for nodes A & 
B is A, then B. A thread holds a lock on B already and needs to lock A. It will make an asyn- 
chornous lock call for A. If the thread is unable to get the lock, it is on A's waitq, and it releases 
the lock on B. It then waits for the lock on A to be granted and then reacquires the lock on B 
(which is in canonical order). In the above sequence, the thread always cither holds a lock (on 
A or B) or is in the waitq of a lock (on A). Other guarantees in the data structure assure that 
in this case A and B will remain valid and hence there will be no illegal access. Achieving this 
without asynchronous locking is not possible. Using trylock on A and upon failure, releasing B 
then locking A leaves a window open between release of B and locking of A where neither node is 
in any way aware of the thread. One or both nodes could go away in that window and the thread 
would end up performing an illegal access. 

We are also working on building highly concurrent versions of other data structures (trees of var- 
ious kinds) where we expect to use asynchronous locking frequently. Note that since there is only 
one waiter structure per thread per domain, a thread can only be performing one asynchronous 
lock operation per domain at any time. To keep the discussion focused on Iwlocks, we cannot go 
into any more details of our list or other data structures here. 

On the unlocking side of the operations, we note that since the waitq management is visible in 
user space, the unlocking thread has a lot of flexibility in picking which of the waiting threads to 
signal and whether to do lock hand-off or have the signaled thread retry. This can be exploited 
to create any ciistom scheduling policy. We could pick the thread with the highest priority or the 
longest waiting thread or even have applications use the app.data to define their own preferences. 
Signaling waiters in LIFO instead of FIFO order would trade fairness for performance as we 
illustrate in Section 5. 

Finally, with most hardware supporting 64-bit CAS instructions, the generic building blocks of 
16-bit waitq leaves 48 bits available for building other primitives. For example, we have built 
semaphore like counters and a combined mutex-l-condition variable structure, and implemented 
upgrade and downgrade operations for lw_rwlocks (see Appendix B for lw_rwlock algorithms 



that allow these). Although our implementation has focussed on process-private locks, we believe 
it is possible to extend the approach to include process-shared locks. For example, the Linux 
operating system limits the maximum number of processes to 2^^ which would give a natural 
mapping from the process id to the waiter structure id for the process. The structures could 
be managed in user space shared memory or the kernel could manage them. Using an actual 
semaphore would be more appropriate to use in this case to implement the event interface for the 
waiters. 

5 Performance 

We now examine some experimental data to show that the performance of Iwlocks is acceptable. 
The experiments were performed on a 4-sockct system with Xeon E7-4860 processors. Each socket 
has 10 physical (20 with hypor-thrcading enabled) cores, for a total of 40 (80 hyper-threaded) 
cores. The machine has 256GB of memory and each core operates at 2.26GHz. 

We have carried out three sets of experiments. Each experiment was run 20 times which was 
enough to get a confidence level of 99% on the presented average values. The first one compares 
the performance of unfair Iwmutexes with \mfair pthrcad mutexes. Unfair miitexcs trade off 
fairness for performance by using the greedy approach: the unlocking thread can reacquire the 
lock right away again. This is done to avoid the convoy problem. We have implemented two 
versions of unfair Iwmutexes: (i) LIFO wake-ups, which wakes up the most recent thread in the 
waitq; and (ii) FIFO wake-ups, which wakes up the longest-waiting thread in the waitq. The 
experiment consists of n threads carrying out the same number of operations on a global doubly- 
linked list protected by a single unfair mutex - each operation has the same cost. Each thread 
acquires the global mutex, performs an operation and drops the mutex. There is no activity 
outside the locked code block except to increment the loop counter. 

Figure 4 (a) shows how the latency per operation increases with the number of contending threads. 
As the number of threads increases, the per operation cost goes up for all lock types. Note that, for 
relatively low contention (n < 10), unfair Iwmutexes perform as good as unfair pthread mutex^. 
We are satisfied that our implementation is reasonably efficient from the performance shown by 
unfair Iwmutex. The gap between pthread mutex and LIFO unfair Iwmutex arises from the fact 
that pthread mutex try the CAS operation only once before making a system call to block. The 
Iwmutex code (both lock and unlock) has to contend until the caller has performed a successful 
CAS operation. The performance gap betweek LIFO and FIFO version of Iwmutex hightlight the 
overhead of traversing the waitq. It is well known that a fair mutex is considerably slower than 
an unfair one under high contention due to frequent context switches (the convoy problem). For 
32 contending threads we saw that the latency per operation can go as high as 13x the latency 
per operation seen for unfair mutexes. However, If there is no contention or just a few contending 
threads (< 3), the latency per operation is very close to the one obtained with unfair mutexes. 
We note that performance parity with pthread mutexes was never our goal. Although we believe 
that with proper tuning the cost difference between Iwmutexes and pthread locks can be reduced 
further, our primary concern is the memory overhead that prevents their use in extremely fine- 
grained locking. Fine-grained locking results in lower contention in general and hence improved 
performance overall as we show in the next experiment. 

The second experiment illustrates how fine-grained locking can deliver better performance overall. 
The experiment consists of n threads performing lookups, followed by an update to the looked- up 

^ Our code is written entirely in C and compiled with 04 optimization. Pthread code is part C and part 
fine-tuned assembly. 
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Fig. 4: Latency per operation for: (a) coarse-grained locking; (b) fine-grained locking. 

record, on a hash table. The hash table has 1 million buckets and is populated with 2 million 
elements (chaining is used as the collision resolution scheme). We evaluate the latency per oper- 
ation (in microseconds) for two cases: (i) a fair Iwmutex is embedded in each bucket's list head; 
and (ii) 1,024 unfair pthread mutexes are used, where each one protects a range of 1,024 buckets. 

Figure 4 (b) shows how the latency per operation increases as the number of threads concurrently 
operating on the hash table increases. As can be seen, it is preferable to have fine-grained locking 
than optimizing the performance of the lock itself. Also, when the lock is placed within the bucket 
itself, it improves the memory locality and may have fewer cache misses compared to accessing 
pthread mutex located in a separate memory area. For the hash table case is very easy to map 
from a bucket to a pthread mutex stored in a separate area. That is not true for other data 
structures like linked lists and trees. Additional logic to minimize the number of locks for those 
data structures introduces complexity which is more difficult to maintain than for the case where 
a lock can be cheaply added per node. Even a hash table that uses open-addressing schemes 
(probing, double hashing or cuckoo hashing [8]) for resolving conflicts presents challenges when 
using range locking. 

Finally, the third experiment compares lw_rwlocks with read-write pthread locks. We use the 
same hash table as before but now we fix the number of threads (readers + writers) to 34 and 
then we vary the number of writers (or contending threads) from to 34. Beyond 34 threads we 
start seeing contention across readers for pthread locks: the contention is on the update of the 
reader counter, which is surrounded by a mutex in the pthread library. Because we only want to 
evaluate the contention due to writers, we in turn, picked 34. 

Figure 5 shows how the latency per operation increases as the number of writers concurrently 
operating on the hash table increases. Once again the fine-grained locking provided by the cheap 
lw_rwlocks delivers better overall performance and also scales better than read-write pthread 
locks. 



6 Conclusions 



We have presented in this paper a new approach to building compact synchronization primitives. 
This is possible because each thread can only block in one lock or condition variable at a time. 
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Fig. 5: Latency per operation on cither a hash tabic using a lw_rwlock per bucket or 1, 024 read- 
write pthread locks, each one protecting a range of 1,024 buckets. The total number of threads 
was fixed to 34 and the number of writers goes from to 34. 

Besides the compact nature of light-weight locks, the queue management of blocked threads is 
also done entirely in user space. This allows the implementation of features that are impossible 
to implement with traditional pthread locks. For instance, asynchronous locking cannot be im- 
plemented with pthread locks as they stand. The cost for light-weight locks is a 166-byte waiter 
structure per thread, which amortizes very quickly for applications where there are many more 
locks than threads. We believe that this is a fairly common scenario. 
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A Pseudo code for light-weight condition veiriables 

Figure 6 presents the structure of a Iwcondvar as well as the operations it supports. 



struct lwcondvar_t { 

lwmutcx_t in; // 2-byte Iwmutex 
uintl6_t waitq; // queue tail 

} 

void wait(lwmutcx_t m , Iwcondvar _t c) { 

w = tls_gct_waiter () ; 
Mi.next = NULLID; 
lock(c.m) ; 

if (c. waitq = NULLID) 

c . waitq = w . id ; 
else { 

ui . next = c. waitq; 

c . waitq = ui . id ; 

} 

unlock (c . m) ; 
unlock (m) ; 
w . event . wait ( ) ; 
lock(m) ; 

} 

void wake_up_waitcrs(waiter_t w) { 
while (w != NULL) { 
waitq = w . next ; 
w . event . signal ( ) ; 
w = id2waiter (waitq) 

} 

} 



void signal(lwcondvar_t c) { 
lock (c. m) ; 

if (c. waitq != NULLID) { 
wtw = id2waiter (c. waitq) ; 

pw = NULL; 

while (wtw. next != NULLID) { 
pw = wtw ; 

wtw = id2waiter . next ) ; 

} 

if (p-w = NULL) c. waitq = NULLID; 

else puj.ncxt = NULLID; 
} else wtw = NULL; 
unlock(c.m) ; 

if (wtw != NULL) ujtu! . event . signal () ; 

// Else missed signal 

} 

void broadcast (Iwcondvar _t c) { 

lock (c. m) ; 

waitq = c. waitq; // pointer to queue's head 
c. waitq = NULLID; 
unlock(c.m) ; 

wake_up_waiters (id2 waiter (waitq) ) ; 

} 



Fig. 6: Operations defined for a Iwcondvar. The old and new values passed in to CAS are denoted 
by o and n, respectively. The caller's thread local waiter structure is denoted by w. We use wtw 
and pw to denote the waiter to wake up and the previous waiter in the queue, respectively. 



B Upgrading and Downgrading light-weight read-write locks 

As mentioned in Section 4, a lw_rwlock also supports upgrade and downgrade operations. Figure 7 
shows the algorithms for the two operations. Note that even though multiple readers could be 
traversing the waitq during upgrade, the traversal is safe. The waitq changes only due to arrival 
of new waiters to the back of the queue or removal of the waiter at the front of the queue during 
lock transfer. The former is immaterial to the traversal as it does not care for what happens to 
waiters behind it. The latter cannot happen as the traversing thread still has a read lock. The only 
possible race happens on the next field of the oldest waiter in the waitq: a reader performing 
an upgrade wants to add it's own waiter in front of it and a thread releasing write lock on a 
reader-biased lock is re-inserting list of existing waiters. This situation is handled by the upgrade 
logic and by the unlock routine. 

The unlock operation presented in Figure 3 has to be slightly changed to support downgrade and 
upgrade of a lw_rwlock. For the case where a writer is releasing the write lock of a read-biased 
lw_rwlock, while re-inserting the wr_q at the front of the waitq of the lw_rwlock, we have to 



use CAS instruction to co-ordinate with a possible upgrader. Also, if an upgrader is found to be 
already present at the front of the waitq, the re-inserted wr_q is added behind the upgrader's 
waiter. 



void downgrade(lw_rwlock_t i) { 
do { 

o = n = £; 

if (o. waitq != NULLID) { 
// Have existing waiters. Can't 
//do direct downgrade. 
break; 

} 

n . wlocked = ; 

n. readers = 1 ; 
} while (!CAS(€, o, n)); 
if (n. readers != 1) { 

w = tls_get_waitcr ( ) ; 

// Indicate that w will now wait 

// for a read lock. 

ui.app_data = SHARED; 

insert_waiter_at_front(^. waitq, w) ; 

// Unlock will grant reader lock and 

// waiter will get a pending signal. 

unlock(€) ; 

// Consuming the pending signal 
w . event . wait ( ) ; 

} 

} 

bool_t upgrade(lw_rwlock_t i) { 
do { 

o = n = £; 

if (o. waitq != NULLID) { 
// There are more waiters that 
// could be upgrading themselves. 
break; 

} else if (o. readers = 1) { 

// Only reader, grab it right away. 

n. wlocked = 1; 

n. readers = 0; 
} else { 

u).app_data = UPGRADE; 

n . waitq = «) . id ; 

n. readers — ; 

} 



} while (!CAS(^, o, n)); 

if (n. wlocked != 1) { 
w = tls_get_waiter ; 
if (n. waitq != ui.id) { 

if (!insert_for_upgrade(^, tu)) { 
return FALSE; //failure 

} 

// Unlock and wait for lock 
// to be granted. 
unlock (£) ; 

} 

w . event . wait ( ) ; 

} 

return TRUE; // success 

} 

bool_t insert Jor_upgrade(lw_rwlock_t i, 
waiter_t w) { 

while (TRUE) { 

ow = find_oldcst .waiter (£) ; 
if (ow.app.data == UPGRADE) { 
// Someone else waiting for upgrade 
return FALSE; 

} 

// Try setting next pointer of last waiter. 

// Set the app^data first since it needs to 

//be visible to any competing thread that 

//is also trying the upgrade. 

w . app.data = UPGRADE ; 

if (!CAS(ow.ncxt, NULLID, w.id)) { 

// Competing upgrade or re-insert 

ow = id2waiter (oto . next ) ; 

if (ouj.app.data == UPGRADE) { 
// Lost to competing upgrade 
return FALSE; // failure 

} // else lost to competing re-insert 
} else return TRUE; // CAS success 

} 



Fig. 7: Operations to downgrade and upgrade a lw_rwlock. The old and new values passed in to 
CAS are denoted by o and n, respectively. The caller's thread local waiter structure is denoted 
by w. We use ow to denote oldest waiter on the waitq (the head of the queue). 



